APItest/t/utf8_warn_base: Add tests #22646

khwilliamson · 2024-10-08T18:49:26Z

One UTF-8 malformation is when the string has a start byte in it before the expected end of the character. This test file tested the case where the unexpected byte came in the final position. GH #22597 found bugs where the undexpected byte came immediately after the first byte.

This commit adds tests for unexpected bytes in all possible positions. if the fix for GH #22597 is reverted, this new revised file has 1400 failures.

This set of changes does not require a perldelta entry.

jkeenan · 2024-10-09T22:38:27Z

The following test failed in one (but only one) of our CI configurations: ext/XS-APItest/t/utf8_warn00.t. See https://github.com/Perl/perl5/actions/runs/11241805335/job/31254645122?pr=22646. However, when I tried to reproduce this locally I could not.

I subsequently checked out this pull request and built an unthreaded perl. At the end of make test_prep I got these build-time warnings:

cc -c   -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Werror=pointer-arith -Werror=vla -Wextra -Wno-long-long -Wno-declaration-after-statement -Wc++-compat -Wwrite-strings -O2   -DVERSION=\"1.32\" -DXS_VERSION=\"1.32\" -fPIC "-I../.."   Normalize.c
In file included from ../../perl.h:7895,
                 from Normalize.xs:13:
../../inline.h: In function ‘Perl_is_utf8_string_flags’:
../../inline.h:2090:41: warning: bitwise comparison always evaluates to false [-Wtautological-compare]
 2090 |                                         == UTF8_DISALLOW_ILLEGAL_INTERCHANGE)
      |                                         ^~
../../inline.h:2096:40: warning: bitwise comparison always evaluates to false [-Wtautological-compare]
 2096 |                                        == UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE)
      |                                        ^~
../../inline.h: In function ‘Perl_is_utf8_string_loclen_flags’:
../../inline.h:2527:41: warning: bitwise comparison always evaluates to false [-Wtautological-compare]
 2527 |                                         == UTF8_DISALLOW_ILLEGAL_INTERCHANGE)
      |                                         ^~
../../inline.h:2533:37: warning: bitwise comparison always evaluates to false [-Wtautological-compare]
 2533 |                                     == UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE)
      |                                     ^~
rm -f ../../lib/auto/Unicode/Normalize/Normalize.so
cc  -shared -O2 -L/usr/local/lib -fstack-protector-strong  Normalize.o  -o ../../lib/auto/Unicode/Normalize/Normalize.so  \
      \
  
chmod 755 ../../lib/auto/Unicode/Normalize/Normalize.so
make[1]: Leaving directory '/home/jkeenan/gitwork/perl2/dist/Unicode-Normalize'
./perl -Ilib -I. -f pod/buildtoc -q
cd t && (rm -f perl; /usr/bin/ln -s ../perl perl)

'j' isn't very descriptive for this purpose

I have a better use for this name in mind; and this makes the name more specific

Perl's extended UTF-8 is capable of representing code points up to 2**72 (2**65 on EBCDIC). These won't fit in a 64 bit word, and hence overflow. (And much more so on 32 bit machines.) A start byte of \xFF is required for code points starting with 2**36; \xFE for those starting with 2**31, and so on. But it turns out that a sequence beginning with \xFE can express all code points 0..2**36-1, and \xFF sequences can express everything 0..2**72-1. When a sequence represents a code point that can be expressed by a shorter sequence, it is called an overlong, and using those is expressly forbidden by the Unicode standard due to spoofing attacks that have occurred. So, a \xFE start byte should only be used for code points in the range 2**31..2**36-1; and \xFF only for 2**36..2**65-1. But Perl needs to handle the possibility where the input doesn't match the expectations of what it should be. We have tried to determine all the malformations that apply to a given sequence and return them to the caller when requested. The interplay between overflow and overlong is somewhat tricky, and the new tests that are to be added in the next commit showed that we haven't been doing it completely right. Prior to this commit, the checks for both overlong and overflow had three states: yes, no, and maybe. The last meaning that the sequence being examined was shorter than a full character, and that some possible completions of it would result in yes, and some would result in no. This commit retains the tripartite state of examining a sequence for being overlong, but adds a fourth state for overflow, namely that the input overflows unless the sequence is overlong, and there aren't enough bytes to determine the latter absolutely for sure. But overlongs are rare, so the chances of it being that are tiny, so this state means that it almost certainly overflows. Prior to this commit, I had tried to cope with some of this by an extra parameter to the find-if-overflow function, but this fourth state removes the need for that. The caller gets which state the input is, and then chooses ow to handle it, without needing the parameter. The tests in utf8decode.t also had to be changes, as this new code picks up some overflows on 32-bit machines that were previously not caught.

One UTF-8 malformation is when the string has a start byte in it before the expected end of the character. This test file tested the case where the unexpected byte came in the final position. GH Perl#22597 found bugs where the unexpected byte came immediately after the first byte. This commit adds tests for unexpected bytes in all possible positions. If the fix for GH Perl#22597 is reverted, this new revised file has 1400 failures.

khwilliamson · 2024-10-19T13:40:13Z

#22649 fixed the build warnings; this was updated and passes CI

khwilliamson mentioned this pull request Oct 8, 2024

Some Invalid UTF-8 Sequences Cause Panic #22597

Merged

3 tasks

jkeenan mentioned this pull request Oct 9, 2024

perl_langinfo.h: Fix to work on Android #22650

Merged

khwilliamson force-pushed the nmalformed branch from 1d88d55 to f6a07be Compare October 11, 2024 15:40

khwilliamson added 5 commits October 11, 2024 11:29

APItest/t/utf8_warn_base: Change $variable name

e78b485

'j' isn't very descriptive for this purpose

utf8.c: Add comments; move decls closer to use

3aff961

utf8.c: Change #define name

c226d5d

I have a better use for this name in mind; and this makes the name more specific

khwilliamson force-pushed the nmalformed branch from f6a07be to f108033 Compare October 11, 2024 17:36

khwilliamson merged commit b2215f6 into Perl:blead Oct 19, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APItest/t/utf8_warn_base: Add tests #22646

APItest/t/utf8_warn_base: Add tests #22646

khwilliamson commented Oct 8, 2024 •

edited

Loading

jkeenan commented Oct 9, 2024

khwilliamson commented Oct 19, 2024

APItest/t/utf8_warn_base: Add tests #22646

APItest/t/utf8_warn_base: Add tests #22646

Conversation

khwilliamson commented Oct 8, 2024 • edited Loading

jkeenan commented Oct 9, 2024

khwilliamson commented Oct 19, 2024

khwilliamson commented Oct 8, 2024 •

edited

Loading